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Abstract 

The part-based method has been a fast rising framework for object detection. It is attracting more and more attention for its 
detection precision and partial robustness to the occlusion. However, little research has been focused on the problem of 
occlusion overlapping of the part regions, which can reduce the performance of the system. This paper proposes a part- 
based probabilistic model and the corresponding inference algorithm for the problem of the part occlusion. The model is 
based on the Bayesian theory integrally and aims to be robust to the large occlusion. In the stage of the model construction, 
all of the parts constitute the vertex set of a fully connected graph, and a binary variable is assigned to each part to indicate 
its occlusion status. In addition, we introduce a penalty term to regularize the argument space of the objective function. 
Thus, the part detection is formulated as an optimization problem, which is divided into two alternative procedures: the 
outer inference and the inner inference. A stochastic tentative method is employed in the outer inference to determine the 
occlusion status for each part. In the inner inference, the gradient descent algorithm is employed to find the optimal 
positions of the parts, in term of the current occlusion status. Experiments were carried out on the Caltech database. The 
results demonstrated that the proposed method achieves a strong robustness to the occlusion. 
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Introduction 

Object detection [1] is a classical problem in the field of 
computer vision. Among the numerous methods for object 
detection, the statistical-based approaches [2-13] have become 
mainstream. They discriminate the given object from others by 
learning it, hence achieving a more robust detection. As a typical 
statistical-based method, the part-based model has attracted 
increasing attention in the past decade [5-11,14-19]. As the 
name implies, the part is the local area of the object. The part- 
based models can capture both the local appearance and spatial 
structural information of the object simultaneously, which makes 
these methods robust to the variations of the object pose and 
appearance to some extent. Similar with other statistical-based 
methods, the part-based methods also include two fundamental 
problems: training and detection. The former refers to the 
modeling of the part appearance and spatial relationship among 
the parts. The latter refers to the optimization problem for 
acquiring the information about the parts (such as their position 
and occlusion status). It is worthy to emphasize that, most part- 
based methods pay attention to the part areas only. Therefore, 
they are robust only to the occlusions which do not overlap the 
part areas (as illustrated in Figure l.(a)). However, if the occlusions 
overlap the part areas, as illustrated in Figure l.(b), not only would 
the occluded parts be influenced, the non-occluded ones would 
also be shifted from their right positions due to the spatial 
relationship among the parts (refer to the experiments below for 
more details). 



Related work and our contributions 

In the past years, many part-based methods have emerged, such 
as the bag model [6,7], constellation model [8,15], pictorial 
structure model [9,16], star model [10,14], vocabulary based 
method [17,18] and A:-fan model [11]. The spatial relationship 
among their parts is illustrated in Figure 2. 

However, The bag model almost consider none of spatial 
relationship among parts. The other part-based models improve 
the performance of detection by adding the spatial relationship 
among the parts. Especially, the pictorial structure model and the 
/:-fan model perform fast detection via dynamic programming 
[20,21], since the appearance of generalized distance transform 
(GDT) [22] greatly reduces the time complexity of the dynamic 
programming. However, they does not allow cycles in the spatial 
relationship [20]. There are many solutions for this problem [23- 
27], of which the simplest technique is the gradient descent 
algorithm (GD). 

Although the above part-based models belong to the category of 
sparse representation, they still suffer from the problem of shading 
the part areas (as demonstrated in Figure 3). In this paper, these 
shaded parts are named disabled parts, some methods [19] 
employed a kind of part appearance representation which is robust 
to the occlusion, but the robustness is limited, especially when the 
occlusion region is large. In the literature about the occlusion, 
ignoring the occluded parts from the model is the most intuitive 
idea to solve the occlusion problem [8,14,28,29]. In general, it can 
be achieved by estimating a mask for the test image. However, the 
mask variable is difficult to model in the objective function to 
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Figure 1. Two kinds of occlusions faced by the part-based 
model, (a) Occlusion which does not overlap the part areas, (b) 
Occlusion overlapping the part areas. 
doi:1 0.1 371 /journal.pone.0084624.g001 



achieve a precise inference and cannot work in the case of large 
occlusion. Papandreou et al. proposed to solve the occlusion 
problem by using a robust objective function [30], which weakens 
the role of occlusion parts, but it is difficult to completely eliminate 
the influence of the occluded areas. Li et al. [31] solved the 
occlusion problem under a novel RANSAC framework. However, 
it is difficult to incorporate the spatial relationship to improve the 
detection of the keypoints. In [32,33], the authors solved the 
occlusion problem under the sparse framework, which is usually 
used in the case of batch image processing. 

This paper considers applying the part-based method in object 
detection, with special emphasis on occlusion handling, and 
propose a part-based probabilistic model with an alternative 
detection scheme. In order to increase the detection accuracy, we 
constructed a fully connected graph to describe the spatial 
relationship among parts in the stage of training. Each edge is 
represented by a 2D Gaussian distribution for the vector difference 
of the position coordinates. Moreover, we introduced a penalty 
term to the objective function ensure us to obtain a more accurate 
detection result. For the occlusion problem, we assigned a binary 
status variable to each part to indicate whether it is occluded or 
not, and proposed a method to model the prior probability of the 
occlusion status variable. Then, according to the Bayesian theory, 
we constructed a new posterior probability as the objective 
function. In the stage of detection. We designed two alternative 
procedures, which are named as the inner inference and outer 
inference. The former used the GD to determine the positions of 
the parts given the current occlusion status variable. The outer 
inference is responsible for determining the occlusion status of the 
parts according to their current positions. To address the detection 
efficiently, we adopted a stochastic tentative method in the outer 
inference. In addition, in the procedure of the detection, we 
incorporated the validity test mechanism to avoid the invalid inner 
inference results. 

Methods 

Consider a model with n parts F = {v\v^, ,v"}. A detection 

result of a given image is expressed as // = {L,*S}. The argument 

L = {l\j2, Jn} is the position variable, where li = {yi,Xi} 

denotes the position of part v^ The argument S={s\,S2, ,Sn} 

represents the occlusion status variable, where Si is a Boolean 
variable (if part is a normal part, = 1; otherwise, =0 for the 
disabled part). We define the object function as 

piH\i)=ipii\L,syp(L\sypisyexpiQim/pin, (1) 

which is the posterior probability of a result H given a test image 



/, where Qxp(Q(H)) is a penalty term and Q(H) is defined in the 
following subsection. 

Construction of the model 

In Eq. 1, the posterior probability contains four items. 1. 
p(I\H)=p(I\L,S), which represents the total matching probability 
of the normal parts. 2. p(L\S) represents the priori probability of a 
spatial relationship among normal parts. 3. p(S), a priori 
probability of the occlusion. 4. The penalty term Qxp(Q(H)). 

The total matching probability of all of the normal parts is 



p{i\H)=Co n gidJif 

v>er 



(2) 



where Cq is a constant for a test image /, gi(IJi) is the matching 
probability of a single part [11]. 

For a priori probability of the spatial relationship p(L\S), we 
employed the fully connected graph to represent the spatial 
relationship among the parts. However, as demonstrated in 
Figure 3.(c), the disabled parts will severely affect the detection 
of the normal parts because of the edges between them. We 
overcame this problem by discarding the edges connected to the 
disabled parts. In addition, the positions of the disabled parts were 
supposed to follow the independent uniform distribution [8]. 
Therefore, 



piL\S)=\ U p(WjrJ]/M"-^ 



(3) 



where A is the number of the normal parts with ^/ = 1 , M is a 
constant representing the number of possible positions where a 
part could be placed, E* is the edge set of a fully connected graph, 
and p{li\lj) is defined as p{li — lj) which follows the 2D Gaussian 
distribution. 

Let w^^'"^ denote the conditional probability of part v" being 
shaded under the condition that is disabled. If the mean 

distance d[m,n) between and satisfies that d[m,n) < '^Co {Co is a 
constant), we have proved that (please refer to the Appendix SI for 
the deduction.) 



X log (iCo + ^JK-d^^^^^ /d^^,n)) ■ 
Then a priori probability p(S) is calculated as 



(4) 



p(S)=\ n R(s,,sj)]/U, 

Uv',»/)e£* 



(5) 



where 



R(si,Sj) = 



/ 1 - (1 -pt)(2 - w^'^^l Si = 1 and Sj = l, 

(1 -pt)w^''^\ Si = 0 and sj = 0, (6) 
\ (1 — — w^'*^^), Otherwise, 



represents the joint probability of the occlusion status about the 
part pair (v\v^). Where pt is a constant standing for the probability 
of a part being present, and is a normalization constant. 
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Figure 2. Spatial relationship among parts, (a) Bag model, (b) Constellation model, (c) Pictorial model, (d) k-fan model {k= 1,2,3 from left to 
right). 

doi:1 0.1 371 /journal. pone.0084624.g002 



The penalty term Q(H) aims at improving the detection results 
by emphasizing the weak parts, and also for regularizing the 
argument space of the objective function. It is defined as 



Q(L,S)= mm{Wi(L,S)) 



:min logfc(/,/0)+ J2 (^rlogW//|4))) / J2 



(7) 



Sk 



where Vs is the part set in which the occlusion status of each 
element is 1. 

Substitute Eq. 2, 3, 5, 7 into Eq. 1, the posterior probability can 
be rewritten as 



Pim- 



cx n R{si,sj)x ngiiijifix n p(ii\ijy'y x qxp{q{h)) 

iv',v))eE* v'eV (v',v/)e£* 

■ J^n-A ' 



(8) 



where C= Co/[U-p{I)] is a constant for a test image /. Applying 
the logarithm and minus operations to both sides of Eq. 8, we have 



EiH) = in-A)\ogiAf)- ^ \ogiRiSi,Sj))-QiH) 
-^^rlogfe(/,/,))- ^ SiSj-\og(pmij)). 

v^gV (v^,vj)eE* 



(9) 



The above expression is the right objective function for detection. 

Detection 

In the step of the detection, we look for an optimal detection 
result H* = {L*,S*} with minimum energy, which is 



H*=SiYg min (E(H)). 



(10) 



In this paper, we adopted the strategy of alternative optimization 
to solve the above problem. The basic idea is to let S search in the 
space of *S (called the outer inference), and after each movement of 
S, the inner inference searches the current optimal part positions 
L*^ . These two procedures are performed alternately until the 
terminal conditions are satisfied. 



(a) 




Figure 3. Influence of the disabled parts on the detection of the normal ones, (a) is the manual label of the face. In (b), the eyes have been 
occluded (the occlusion is represented by the dotted box), (c) shows the detection results that the disabled parts degrade the detection of the 
normal ones, (d) illustrates the detection results after the occluded parts are discarded from the model. 
doi:1 0.1 371 /journal. pone.0084624.g003 
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Obtain the initial 
positions of the parts 



Invert a bit of the 
occlusion status variable 




End 



Figure 4. Flow chart of the outer inference. 

doi:1 0.1 371 /journal. pone.0084624.g004 

Inner inference and Outer inference. In the inner 
inference, given the current status vector S = {si,...Sn}, Eq. 9 
can be expressed as 

Es(L) =-Y, sr\og(gi(I,h)) - J2 si^r^ogipihllj)) 

v'eV (ii',ii/)e£-* j-JJ-) 

-Q{L,S) + d, 

where C is a constant. Eq 11 is the right objective function in 
inner inference, we used the gradient descent algorithm (GD) to 
search the current optimal position variable for each part 

r^ = {if,...,if}. 

In the outer inference, given L*^ , the object function Eq. 9 
becomes the function depending on S only: 

EiS) = in-A)log(M)- logiR(s,,Sj))-Q(V^,S) 

(v'y')e£» 

(12) 

-^^rlogfe(/,/;^))- E Si^r^og(pilf\lf)). 

As aforementioned, the outer inference is responsible for 
determining the occlusion status variable. At the beginning of 




Figure 5. Two-part sub-model. 

doi:10.1371/journal.pone.0084624.g005 

the outer inference, the occlusion status variable is assumed to be 
*S={1,1,...,1}, implying that no occlusion happens to any part. 
The aim of the outer inference is to find the next probable S to 
reduce the value of Eq. 12. 

Due to the discreteness of the S space, we adopted a stochastic 
tentative method to address the outer inference. In each iteration, 
we calculated the gradient vector h = G{S) = {dE / ds\,...,dE / dSn} 
for Eq. 1 2. If {Sj = Okh(j) < 0)\(sj = 1 kh(j) > 0) holds, we consider 
the Sj as a feasible descending bit, and consider the Sj which has 
the minimal value of \h(j)\ as the most irresolute bit. The 
procedure of the outer inference is illustrated in Figure 4 and 
detailed in Table 1. 

The validity test is used to validate whether the inner inference 
has obtained a feasible result. For a two-part model, if the likehood 

Oi(lul2)=gl(I,h)g2iI,l2)p(l2\h) (13) 

is larger than some threshold, this two-part model is defined to 
pass the validity test. A full-part model is defined to pass the 
validity test if there is at least one two-part sub-model passing the 
validity test. Step 3g to Step 3h is the procedure of inner inference, 
and can avoid the solutions from deviating from the right 
occlusion status variable. Finally, after we have obtained the 
output, we could estimate the position of the disabled parts just by 
the spatial relationship among all of the parts, i.e. minimizing the 
following expression: 

K(L)=- E logipiWj)), (14) 

where {li\v^EVs* } is the known variable, which has been obtained 
in Algorithm 1 (Table 1). 

Results and Discussion 

In this section, we tested the performance of our method on the 
Faces dataset in the Caltech database [8,34], and the performance 
of 1-fan [11] is compared. For this dataset, as done in [11], six 
parts were selected: the left eye, the right eye, nose, the left corner 
of the mouth, the right corner of the mouth and the chin (defined 
as the part 1,2,..., 6 respectively). In our experiment, the distance 
error e is defined as the mean distance of n parts from the 
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Table 1. Algorithm 1: Outer inference. 



Input: The initial occlusion status variable 5' = {1,1, ,1}, the initial energy value Ot = inf; 

Step 1. Use the simulated annealing algorithm to obtain the initial position, then obtain current optimal position L*^ via the inner inference, if it passes the validity test 
(explained below), go to Step 3, otherwise go to Step 2; 

Step 2. Use all of the two-part sub-model (illustrated in Figure 5) to make an inner inference (determine the positions of these two parts), until a two-part sub-model T 
whose results can pass the validity test appear. Then, estimate the approximate positions of other parts except for the two parts in T, and obtain their position by GD 
further. If none of the two-part models can pass the validity test, quit the outer inference in failure; 

Step 3. While the result S* is not altered and the maximum iteration number m is not reached, 

(a) Calculate gradient vector h; 

(b) If E{S)< = Ot, S* is updated as S, and Ot is updated as £(8); 

(c) In S, search the feasible descending bits; 

(d) If there is no feasible descending bit, invert the most irresolute bit in S, and go to Step 3g, otherwise go to Step 3e; 

(e) If there is only one feasible descending bit, invert it, and go to Step 3g, else go to Step 3f; 

(f) If there are at least two feasible descent bits, therein invert the corresponding bit with a probability proportional to its gradient absolute value; 

(g) Carry out the GD algorithm for Es{L), if the results cannot pass the validity test, go to Step 3h; 

(h) Choose a different bit to invert again randomly, go to Step 3g; 
End While; 

Output: The status result S* and {/*|v'eK5*}- 
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detection results to its corresponding ground truth. The smaller e 
is, the better the given model performs on this specific test image. 

We first demonstrated the influence of the disabled parts on the 
normal ones. We chose 200 images from the Faces dataset to train 
a 1 -fan model, and chose 1 00 test images to construct a test dataset 
by shading the right eye and the right corner of the mouth in each 
test image. Then, we compared the discarded 1-fan model (the 
disabled parts had been discarded from the 1 -fan model) (shown in 
Figure 6 (b)) with the original 1-fan model (shown in Figure 6 (a)). 
We used the distance error e as the evaluation index. The distance 
error e of all of the test images was normalized to [0,1]. The 
comparison result is illustrated in Figure 7, where N{e) is the 
distribution function of the distance error. The horizontal axis 
represents normalized the vertical axis represents the percentage 



of test images whose distance error is smaller than e. It is obvious 
that the higher the curve is, the better the model performs. It 
should be emphasized that only the normal parts were gathered to 
calculate the distance error e. Figure 7 shows that the detection 
results of the discarded 1 -fan were much better than that of the 
original 1-fan model, due to discarding the disabled parts. In other 
words, the disabled parts will severely affect the detection of the 
normal parts if they are not handled properly. Once the occlusion 
happens, the matching degree of the disabled parts is very likely to 
be low at the right positions, so they must search other positions to 
minimize the objective function, which would increase the 
deformation of the edge connecting them. As a result, the adjacent 
normal parts would tune their positions to reduce the edge cost (as 
illustrated in Figure 3.(c)). For this reason, we discarded the 



Left eye 



Right eye 



Left corner 
of mouth 



Nose 



Right corner 
of mouth 



Chin 



Left corner 
of mouth 



Nose 



Right corner 
of mouth 



Chin 



(a) 



(b) 



Figure 6. 1-fan model and discarded 1-fan model. 

doi:1 0.1 371 /journal. pone.0084624.g006 
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Figure 7. Influence of disabled parts on the Faces dataset. 
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disabled parts in our method. The performance will be demon- 
strated in the following experiments. 

In the second experiment (partially occluded experiment), we 
compared the proposed model with the 1-fan model and 
demonstrated the detection accuracy of our method when one 
or two parts were partially occluded. Both models were trained by 
200 images selected randomly from the Faces dataset. We 
randomly selected another 100 images to construct the two test 
datasets. The first dataset, termed as the one-part-shaded test 
dataset, was constructed by shading part(l), part(2),..., part(6) 
respectively with different occlusion degrees for each test image. 
The occlusion degrees is defined as the ratio of the occlusion area 
to part area varied from about 44% (the size of the occlusion 
region was 40 x 40) to 100% (the size of the occlusion region was 
60 X 60), 1 1 values. The number of test images in the one-part- 
shaded dataset was 100x6x 11=6600. The second dataset, 
termed as the two-parts-shaded test dataset, was constructed by 
shading 7 kinds of adjacent part pairs (i.e., part(l,2), part(l,3). 



part(l,4), part(4,6), part(2,3), part(2,5), part(5,6)) respectively with 
different occlusion degrees for each test image. The number of 
images in the second dataset is 100x7x 11=7700. The test 
images in the two-parts-shaded test dataset are illustrated in 
Figure 8. The distance error e was also used as the evaluation 
index for detection accuracy. The average distance errors for all of 
the test images are plotted in Figure 9. As a typical instance, the 
results on the part(l,2)-shaded test are listed in Table 2. 

Figure 9 shows that the average distance error for our model is 
almost constant and much smaller than that of the 1 -fan model 
when the occlusion degree changes from 44% to 100%. These 
results are due to the fact that the disabled parts were discarded 
from our model, and could not affect the detection of the normal 
parts. For the 1-fan model, the average distance error on the one- 
part-shaded dataset was smaller than that on the two-part-shaded 
dataset. What is more, the average distance error for the 1-fan 
model increase with the increase in the occlusion degree. 
Specifically from Table 2, once the occlusion degree exceeded 




(a) 



(b) 



Figure 8. Sample images in the two-part-shaded test dataset. (a) is an image with part(1,2) being shaded (the occlusion degree is 81%). (b) is 
an image with part(4,6) being shaded (the occlusion degree is 64%). The purple solid boxes represent the part regions and the black dotted boxes 
reperesent the occlusion regions. 
doi:1 0.1 371 /journal. pone.0084624.g008 
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Figure 9. Average distance error of the partially occluded experiment. 

doi:1 0.1 371 /journal. pone.0084624.g009 



70%, the average distance error increased sharply. That is because 
the information of the face held by part(l,2) (see Figure 8 (a)) was 
more than the other parts, thus the occlusion of part(l,2) greatly 
misguide the 1-fan model. 

To further evaluate the performance of our method, we 
constructed four test datasets by complete shading one, two, 
three, or four parts (named the completely shading experiment). 
We carried out our algorithm on these test datasets. To evaluate 
the occlusion status variable S, we used two evaluation indices: the 
occlusion false alarm probability pf and the occlusion false 
dismissal probability for all of the bits in the occlusion status 
variable S. We defined the occlusion false alarm probability pf as 
the probability that the bit in S was wrongly estimated as 0, but it 

Table 2. Average e of the 1-fan model and our method when 
part(1,2) was shaded. 



Occlusion degree 
of each part 


Average e of 1-fan 
(pixel) 


Average e of Our model 
(pixel) 


44.00% 


6.6096 


2.6243 


49.00% 


6.9762 


2.6026 


54.00% 


7.3334 


2.5952 


59.00% 


7.7257 


2.6025 


64.00% 


8.1099 


2.5939 


69.44% 


8.7296 


2.5946 


75.11% 


1 1 .6345 


2.6098 


81.00% 


11.9330 


2.6090 


87.11% 


14.0736 


2.6091 


93.44% 


16.0132 


2.6021 


100.00% 


18.3148 


2.5845 


doi:1 0.1 371/journal.pone.0084624.t002 



PLOS ONE I www.plosone.org 7 



was actually 1 . We defined the occlusion false dismissal probability 
Pd as the probability that the bit in S was wrongly estimated as 1 , it 
was actually 0. To evaluate the position variable L, we also used 
the distance error e as the evaluation index. The results of 
complete shading experiment are shown in Table 3. We did not 
compare our method with the 1-fan model in this experiment 
because the 1-fan model almost cannot work in the case where 
three or more parts are shaded. 

From Table 3, we can see that both pf and pf increase as the 
number of disable parts increases. That is because when more 
parts are occluded, it will be more difficult to obtain valid results in 
Step 1 of Algorithm 1 (Table 1), and the number of the valid two- 
part sub-models will be also reduced in Step 2 of Algorithm 1 
(Table 1). Table 3 also shows that average distance error increases 
as the number of disabled parts increases. The reasons, except for 
those illustrated above, also lie in that the disabled parts are 
estimated only by the spatial relationship with the normal ones. 
The experimental results in Table 3 demonstrate that our method 
is competent for object detection even though most parts are 
occluded. 



Table 3. The results of complete shading experiment on the 
Faces dataset. 



Test set 


Pf 


Pd 


Average e (pixel) 


One-part-shaded test datasets 


0.13% 


0.67% 


2.4057 


Two-part-shaded test datasets 


042% 


0.50% 


2.6009 


Three-part-shaded test datasets 


2.78% 


1.11% 


3.8830 


Four-part-shaded test datasets 


5.33% 


1.83% 


5.8602 


doi:10.1371/journal.pone.0084624 


t003 
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